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Both  the  data  generation  process  and  the  missing  data 
mechanism  are  considered  to  be  random  processes,  with 
joint  probability  distribution  given  by 

We  use  the  parameters  6  for  the  data  generation  process 
and  a  distinct  set  of  parameters,  (f>,  for  the  missing  data 
mechanism. 

Once  the  data  probability  model  has  been  decom¬ 
posed  as  in  (1),  we  can  distinguish  three  nested  types 
of  missing  data  mechanism.  The  data  can  be 

1.  Missing  completely  at 

random  (MCAR):  =:  That  is, 

the  probability  that  Xij  is  missing  is  independent 
of  the  values  in  the  data  vector. 

2.  Missing  at  random  (MAR):  P{R\X^,  ,(j))  = 

P{R\X'^,<f>).  That  is,  the  probability  that  Xij  is 
missing  is  independent  of  values  of  missing  com¬ 
ponents  of  the  data,  but  may  depend  on  the  val¬ 
ues  of  observed  components.  For  example,  xij  may 
be  missing  for  certain  values  of  Xik^(k:^j)  provided 
that  Xik  is  always  observed.  Figure  1  illustrates 
this  case. 

3.  Not  missing  at  random  (NMAR).  That  is, 

P(R\X°,  may  depend  on  the  value  of  Xij. 

If  P{Rij \xij,<f))  is  a  function  of  Xij  the  data  is  said 
to  be  censored.  For  example,  if  a  sensor  fails  when 
its  input  exceeds  some  range  its  output  will  be  cen¬ 
sored. 

The  type  of  the  missing  data  mechanism  is  critical  in 
evaluating  learning  algorithms  for  handling  incomplete 
data.  Full  maximum  likelihood  and  Bayesian  approaches 
can  handle  data  that  is  missing  at  random  or  completely 
at  random.  Several  simpler  learning  approaches  can  han¬ 
dle  MCAR  data  but  fail  on  MAR  data  in  general.  No 
general  approaches  exist  for  NMAR  data. 

For  both  Bayesian  and  maximum  likelihood  tech¬ 
niques  the  estimates  of  the  parameters  0  and  <j)  are  linked 
to  the  observed  data,  and  R^  via  P{X^j  R\0,  </>).^  For 
maximum  likelihood  methods  the  likelihood  is 

L{e,<l)\X^,R)ocP{X^,R\e,<l>), 
and  for  Bayesian  methods  the  posterior  probability  is 
P{0,  <I>\X-,  R)  oc  P{X-,  R\e,  <l>)P{e,  (i>). 

We  wish  to  ascertain  under  which  conditions  the  pa¬ 
rameters  of  the  data  generation  process  can  be  estimated 
independently  of  the  parameters  of  the  missing  data 
mechanism.  Given  that 

p{x°,  R\e,  J  p{x°, x'^\e)P{R\x°,  x”',  <f>)dx^, 

we  note  that  if 

P{R\X°,X^,<^)  =  P{R\X°,(I,), 

^For  succinctness  will  use  the  non- Bayesian  phrase  ‘‘esti¬ 
mating  parameters”  in  this  section;  this  can  be  replaced  by 
“calculating  the  posterior  probabilities  of  the  parameters”  for 
the  parallel  Bayesian  argument. 


then 

P{X°,R\e,</))  =  P{R\X°,<j))  J  P{X°,X^\0)dX”' 
PiR\X°,<P)P{X°\e).  (2) 

Equation  (2)  states  that  if  the  data  is  MAR  then 
the  likelihood  can  be  factored.  For  maximum  like¬ 
lihood  methods  this  implies  directly  that  maximizing 
L{9\X^)  oc  P(X^\0)  as  a  function  of  0  is  equivalent  to 
maximizing  L{6,^\X^,  R).  Therefore  the  parameters  of 
the  missing  data  mechanism  can  be  ignored  for  the  pur¬ 
poses  of  estimating  6  (Little  and  Rubin,  1987). 

For  Bayesian  methods,  the  missing  data  mechanism 
cannot  be  ignored  unless  we  make  the  additional  require¬ 
ment  that  the  prior  is  factorizable: 

p{oj)  =  P{d)Pi<i>). 

These  results  imply  that  data  sets  that  are  NMAR,  such 
as  censored  data,  cannot  be  handled  by  Bayesian  or 
likelihood-based  methods  unless  a  model  of  the  missing 
data  mechanism  is  also  learned.  On  the  positive  side, 
they  also  imply  that  the  MAR  condition,  which  is  weaker 
than  the  MCAR  condition,  is  sufficient  for  Bayesian  or 
likelihood-based  learning. 

3  Likelihood-Based  Methods  for 
Feedforward  Networks 

In  the  previous  section  we  showed  that  maximum  likeli¬ 
hood  methods  can  be  utilized  for  estimating  the  param¬ 
eters  of  the  data  generation  model,  ignoring  the  missing 
data  mechanism,  provided  that  the  data  is  missing  at 
random.  We  now  turn  to  the  problem  of  estimating  the 
parameters  of  a  model  from  incomplete  data. 

We  focus  first  on  feedforward  neural  network  models 
before  turning  to  a  class  of  models  where  the  missing 
data  can  be  incorporated  more  naturally  into  the  esti¬ 
mation  algorithm.  For  feedforward  neural  networks  we 
know  that  descent  in  the  error  cost  function  can  be  inter¬ 
preted  as  ascent  in  the  model  parameter  likelihood  (e.g. 
White,  1989).  In  particular  if  the  target  vector  is  as¬ 
sumed  to  be  Gaussian,  P(y,|xi,^)  N{yi]  fe(Ki),  af  I), 
then  the  log  likelihood  is  equivalent  to  the  sum-squared 
error  weighted  by  the  output  variances: 

maxy]logP(y,|xi,6>)  mini  V  -l(y.  _  fg{xi)f 

Z  (T  i 

i  I 

If  a  target  is  missing  or  unknown  the  variance  of  that 
output  can  be  taken  to  be  infinite,  af  ^  oo.  Similarly,  if 
certain  components  of  a  target  vector  are  missing  we  can 
assume  that  the  variance  of  that  component  is  infinite. 
The  missing  targets  drop  out  of  the  likelihood  and  the 
minimization  can  proceed  as  before,  simply  with  certain 
targets  replaced  by  “don’t  cares.” 

If  components  of  an  input  vector  are  missing,  how¬ 
ever,  then  the  likelihood  is  not  properly  defined  since 
P{yi\xi,0)  depends  on  the  full  input  vector.  The  con¬ 
ditional  over  the  observed  inputs  needed  for  the  likeli¬ 
hood  requires  integrating  out  the  missing  inputs.  This, 
in  turn,  requires  a  model  for  the  input  density,  P(x), 


f 


which  is  not  explicitly  available  in  a  feedforward  neural 
network. 

Tresp  et  al.  (1994)  proposed  solving  this  problem  by 
separately  estimating  the  input  density,  P(x),  with  a 
Gaussian  mixture  model  and  the  conditional  density, 
P(y|x),  with  a  feedforward  network.  This  approach  can 
be  seen  as  maximizing  the  joint  input-output  log  likeli¬ 
hood 

/  =  y^logF(xi,yi|0,(?i) 

i 

=  y]logP(yi|xi,6')  + ^logP(xi|.i) 

i  i 


where  the  feedforward  network  is  parametrized  by  0  and 
the  mixture  model  is  parametrized  by  If  some  com¬ 
ponents  of  an  input  vector  are  missing  the  observed  data 
likelihood  can  be  expressed  as 

P(yi|x9,0)=  J  P(yi|x^x‘",^)P(x™|xf,0)dx"’, 

where  the  input  has  been  decomposed  into  its  observed 
and  missing  components,  x  =  The  mixture 

model  is  used  to  integrate  the  likelihood  over  the  missing 
inputs  of  the  feedforward  network.  The  gradient  of  this 
likelihood  with  respect  to  the  network  parameters, 


do 


exhibits  error  terms  which  weight  each  completion  of  the 
missing  input  vector  by  P(x"^|yi,  x°,  9,  (f)).  This  term,  by 
Bayes  rule,  is  proportional  to  the  product  of  the  prob¬ 
ability  of  the  completion  given  the  input,  P(y:^\xf  ,(f)), 
and  the  posterior  probability  of  the  output  given  that 
completion  P(yi|x°,  x^,  0).  The  integral  can  be  approx¬ 
imated  by  a  Monte  Carlo  method,  where,  for  each  miss¬ 
ing  input,  several  completions  are  generated  according 
to  the  input  distribution.  An  intuitively  appealing  as¬ 
pect  of  this  method  is  that  more  weight  is  placed  on 
error  gradients  from  input  completions  that  better  ap¬ 
proximate  the  target  (Tresp  et  al.,  1994;  Buntine  and 
Weigend,  1991). 

These  arguments  imply  that  computing  maximum 
likelihood  estimates  from  missing  inputs  requires  a 
model  of  the  joint  input  density.  In  principle  this  could 
be  achieved  by  multiple  feedforward  networks  each  learn¬ 
ing  a  particular  conditional  density  of  inputs.  For  exam¬ 
ple,  if  the  pattern  of  missing  inputs  is  monotone,  i.e.  the 
d  input  dimensions  can  be  ordered  such  that  if  xij  is 
observed  then  all  Xik  for  k  <  j  are  also  observed,  then 
the  missing  data  can  be  completed  by  a  cascade  of  d  —  1 
networks.  Each  network  is  trained  to  predict  one  in¬ 
put  dimension  from  completed  instances  of  all  the  lower 
index  input  dimensions  and  therefore  models  that  par¬ 
ticular  conditional  density  (cf.  regression  imputation  for 
monotone  multivariate  normal  data;  Little  and  Rubin, 
1987). 

However,  to  accommodate  general  patterns  of  miss¬ 
ing  inputs  and  targets  the  approach  of  using  multiple 
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feedforward  networks  becomes  practically  cumbersome 
as  the  number  of  such  networks  grows  exponentially  with 
the  data  dimensionality.  This  problem  can  be  avoided 
by  modeling  both  the  input  and  output  densities  using 
a  mixture  model. 

4  Mixture  Models  and  Incomplete  Data 

The  mixture  modeling  framework  allows  learning  from 
data  sets  with  arbitrary  patterns  of  incompleteness. 
Learning  in  this  framework  is  a  classical  estimation  prob¬ 
lem  requiring  an  explicit  probabilistic  model  and  an  al¬ 
gorithm  for  estimating  the  parameters  of  the  model.  A 
possible  disadvantage  of  parametric  methods  is  their  lack 
of  flexibility  when  compared  with  nonparametric  meth¬ 
ods.  Mixture  models,  however,  largely  circumvent  this 
problem  as  they  combine  much  of  the  flexibility  of  non¬ 
parametric  methods  with  certain  of  the  analytic  advan¬ 
tages  of  parametric  methods  (McLachlan  and  Basford, 
1988). 

Mixture  models  have  been  utilized  recently  for  super¬ 
vised  learning  problems  in  the  form  of  the  “mixtures  of 
experts”  architecture  (Jacobs  et  al.,  1991;  Jordan  and 
Jacobs,  1994).  This  architecture  is  a  parametric  re¬ 
gression  model  with  a  modular  structure  similar  to  the 
nonparametric  decision  tree  and  adaptive  spline  models 
(Breiman  et  al,,  1984;  Friedman,  1991).  The  approach 
presented  here  differs  from  these  regression-based  ap¬ 
proaches  in  that  the  goal  of  learning  is  to  estimate  the 
density  of  the  data.  No  distinction  is  made  between  in¬ 
put  and  output  variables;  the  joint  density  is  estimated 
and  this  estimate  is  then  used  to  form  an  input/output 
map.  Similar  density  estimation  approaches  have  been 
discussed  by  Specht  (1991)  for  nonparametric  models, 
and  Nowlan  (1991)  and  Tresp  et  al.  (1994),  among  oth¬ 
ers,  for  Gaussian  mixture  models.  To  estimate  the  vec¬ 
tor  function  y  =  /(x)  the  joint  density  P(x,y)  is  esti¬ 
mated  and,  given  a  particular  input  x,  the  conditional 
density  P(y|x)  is  formed.  To  obtain  a  single  estimate  of 
y  rather  than  the  full  conditional  density  one  can  evalu¬ 
ate  y  =  P(y|x),  the  expectation  of  y  given  x. 

The  most  appealing  feature  of  mixture  models  in  the 
context  of  this  paper  is  that  they  can  deal  naturally  with 
incomplete  data.  In  fact,  the  problem  of  estimating  mix¬ 
ture  densities  can  itself  be  viewed  as  a  missing  data  prob¬ 
lem  (the  “labels”  for  the  component  densities  are  miss¬ 
ing)  and  an  Expectation-Maximization  (EM)  algorithm 
(Dempster  et  al.,  1977)  can  be  developed  to  handle  both 
kinds  of  missing  data. 

4.1  The  EM  algorithm  for  mixture  models 

This  section  outlines  the  estimation  algorithm  for  find¬ 
ing  the  maximum  likelihood  parameters  of  a  mixture 
model  (Dempster  et  al,,  1977).  We  model  the  data 
X  =  as  being  generated  independently  from  a 

mixture  density 

M 

P(x.)  =  53p(x,|u;,;9,)P(w,),  (3) 

where  each  component  of  the  mixture  is  denoted  ujj  and 
parametrized  by  Oj.  We  start  by  assuming  complete 


I 


data.  From  equation  (3)  and  the  independence  assump¬ 
tion  we  see  that  the  log  likelihood  of  the  parameters 
given  the  data  set  is 

N  M 

=  Yl^og'^p{xi\uj-,ej)P{u}j). 
i=l  j~l 

By  the  maximum  likelihood  principle  the  best  model  of 
the  data  has  parameters  that  maximize  l(0\X).  This 
function,  however,  is  not  easily  maximized  numerically 
because  it  involves  the  log  of  a  sum. 

Intuitively,  there  is  a  “credit- assignment”  problem: 
it  is  not  clear  which  component  of  the  mixture  gener¬ 
ated  a  given  data  point  and  thus  which  parameters  to 
adjust  to  fit  that  data  point.  The  EM  algorithm  for 
mixture  models  is  an  iterative  method  for  solving  this 
credit-assignment  problem.  The  intuition  is  that  if  one 
had  access  to  a  “hidden”  random  variable  z  indicating 
which  data  point  was  generated  by  which  component, 
then  the  overall  maximization  problem  would  decou¬ 
ple  into  a  set  of  simple  maximizations.  Using  the  bi¬ 
nary  indicator  variables  Z  =  defined  such  that 

z?  =  -  and  Zij  =  1  iff  is  generated  by 

Gaussian  j,  a  “complete-data”  log  likelihood  function 
can  be  written 

N  M 

mx.z)  =  Y.Y,  Zij  \og[P{xi\zi\0)P{zi-,9)],  (4) 

Z=1 j-l 

which  does  not  involve  a  log  of  a  summation. 

Since  z  is  unknown  Ic  cannot  be  utilized  directly,  so  we 
instead  work  with  its  expectation,  denoted  by  Q{0\0k). 
As  shown  by  (Dempster  et  al.,  1977),  1{0\X)  can  be  max¬ 
imized  by  iterating  the  following  two  steps: 

E-step:  Q{9\ek)  ^  E[le{e\X ,  Z)\X  Jk] 

M-step:  Ok+i  =  argmax  Q{0\0k)^  (5) 

0 

The  Expectation  or  E-step  computes  the  expected  com¬ 
plete  data  log  likelihood,  and  the  Maximization  or  M- 
step  finds  the  parameters  that  maximize  this  likelihood. 
In  practice,  for  densities  from  the  exponential  family  the 
E-step  reduces  to  computing  the  expectation  over  the 
missing  data  of  the  sufficient  statistics  required  for  the 
M-step.  These  two  steps  form  the  basis  of  the  EM  algo¬ 
rithm  for  mixture  modeling. 

4.1.1  Incorporating  missing  values  into  the  EM 
algorithm 

In  the  previous  section  we  presented  one  aspect  of  the 
EM  algorithm:  learning  mixture  models.  Another  im¬ 
portant  application  of  EM  is  to  learning  from  data  sets 
with  missing  values  (Little  and  Rubin,  1987;  Dempster 
et  al.,  1977).  This  application  has  been  pursued  in  the 
statistics  literature  mostly  for  non-mixture  density  es¬ 
timation  problems.^  We  now  show  how  combining  the 

^Some  exceptions  are  the  use  of  mixture  densities  in  the 
context  of  contaminated  normal  models  for  robust  estima¬ 
tion  (Little  and  Rubin,  1987),  and  in  the  context  of  mixed 
categorical  and  continuous  data  with  missing  values  (Little 
and  Schluchter,  1985). 


missing  data  application  of  EM  with  that  of  learning 
mixture  parameters  results  in  a  set  of  clustering,  classi¬ 
fication,  and  function  approximation  algorithms  for  in¬ 
complete  data. 

Using  the  previously  defined  notation,  is  divided 
into  (xf,x-^)  where  each  data  vector  can  have  different 
patterns  of  missing  components.  (To  denote  the  missing 
and  observed  components  in  each  data  vector  we  would 
ordinarily  introduce  superscripts  and  o^,  however,  we 
have  simplified  the  notation  for  the  sake  of  clarity.) 

To  handle  missing  data  we  rewrite  the  EM  algorithm 
incorporating  both  the  indicator  variables  from  algo¬ 
rithm  (5)  and  the  missing  inputs, 

E-step:  Q{d\0k)  =  E[l,{0\X^,  X^ ,  Z)\X^,  Ok] 

M-step:  Ok+i  =  argmax  Q{0\0k)- 

0 


The  expected  value  in  the  E-step  is  taken  with  respect 
to  both  sets  of  missing  variables.  We  proceed  to  illus¬ 
trate  this  algorithm  for  two  classes  of  models,  mixtures 
of  Gaussians  and  mixtures  of  Bernoullis,  which  we  later 
use  as  building  blocks  for  classification  and  function  ap¬ 
proximation. 


4.1.2  Real- valued  data:  mixture  of  Gaussians 

Real-valued  data  can  be  modeled  as  a  mixture  of 
Gaussians.  We  start  with  the  estimation  algorithm  for 
complete  data  (Duda  and  Hart,  1973;  Dempster  et  al., 
1977;  Nowlan,  1991).  For  this  model  the  E-step  simpli¬ 
fies  to  computing  E[zij\xi^0k]^  Given  the  binary  nature 
of  Zij,  E[zij\xi,  Ok],  which  we  denote  by  hij,  is  the  prob¬ 
ability  that  Gaussian  j  generated  data  point  i. 

h-.  -  -  Aj)} 

-  A;)} 
(6) 

The  M-step  re-estimates  the  means  and  covariances  of 
the  Gaussians'^  using  the  data  set  weighted  by  the  hij : 


„  Ylizzl 

n  -  ’ 

l^i  =  l 


(7) 


Ef=i  hijixi  -  .o, 

To  incorporate  missing  data  we  begin  by  rewriting  the 
log  likelihood  of  the  complete  data, 


^ + 1  > 


N  M 

lc{d\X° ,  ,  Z)  =  Y]  Zjj  log  |zi ,  0)  + 

i  3 
N  M 

logP(zil^).  (9) 

i  j 

We  can  ignore  the  second  term  since  we  will  only  be  es¬ 
timating  the  parameters  of  the  P{xi\zi,0).  Specializing 
equation  (9)  to  the  mixture  of  Gaussians  we  note  that 

'^Though  this  derivation  assumes  equal  priors  for  the 
Gaussians,  if  the  priors  are  viewed  as  mixing  parameters  they 
can  also  be  learned  in  the  maximization  step. 
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i 


if  only  the  indicator  variables  Zi  are  missing,  the  E  step 
can  be  reduced  to  estimating  E[zij\^i,0]  as  before.  For 
the  case  we  are  interested  in,  with  both  Zi  and  xp  miss¬ 
ing,  we  expand  equation  (9)  using  m  and  o  superscripts 
to  denote  subvectors  and  submatrices  of  the  parameters 
matching  the  missing  and  observed  components  of  the 
data,^  to  obtain 

=  y^y]zy[^log27r+-log|E,-| 

i  j 

-i(xr-/xffE-^’"'-(xf'-Mr)]. 


Note  that  after  taking  the  expectation,  the  suffi¬ 
cient  statistics  for  the  parameters  include  three  un¬ 
known  terms,  Zjj,  ’  Thus 

we  must  compute:  |x°,  E[zijxf\x° ,0k]j  and 

£[2i,-x7x-^|x?,0i]. 

One  intuitive  approach  to  dealing  with  missing  data 
is  to  use  the  current  estimate  of  the  data  density  to  com¬ 
pute  the  expectation  of  the  missing  data  in  an  E-step, 
complete  the  data  with  these  expectations,  and  then  use 
this  completed  data  to  re-estimate  parameters  in  an  M- 
step.  However,  as  we  have  seen  in  section  1,  this  intuition 
fails  even  when  dealing  with  a  single  two-dimensional 
Gaussian;  the  expectation  of  the  missing  data  always  lies 
along  a  line,  which  biases  the  estimate  of  the  covariance. 
On  the  other  hand,  the  approach  arising  from  applica¬ 
tion  of  the  EM  algorithm  specifies  that  one  should  use 
the  current  density  estimate  to  compute  the  expectation 
of  whatever  incomplete  terms  appear  in  the  likelihood 
maximization.  For  the  mixture  of  Gaussians  these  in¬ 
complete  terms  are  the  interactions  between  the  indica¬ 
tor  variable  Zij  and  the  first  and  second  moments  of  xf^ . 
Thus,  simply  computing  the  expectation  of  the  missing 
data  Zi  and  x^  from  the  model  and  substituting  those 
values  into  the  M  step  is  not  sufficient  to  guarantee  an 
increase  in  the  likelihood  of  the  parameters. 

To  compute  the  above  expectations  we  define 


iTj  =  E[xT\zii  =  9,]  =  +  E7°Ef  (x°  - 


which  is  the  least-squares  linear  regression  between  xj” 
and  x°  predicted  by  Gaussian  j.  Then,  the  first  expec¬ 
tation  is  E[zij\x°jOk]  =  hij,  the  probability  as  defined 
in  (6)  measured  only  on  the  observed  dimensions  of  x^. 
Similarly,  we  get 


E[zijx^\x°,  dk]  =  hijx'. 


IJ  J 


^For  example,  S  is  divided  into 


^OO 

^om 

^mo 

spending  to  x  = 


.  Also  note  that  the  superscript 


(  — l,oo)  denotes  inverse  followed  by  submatrix  operations, 
whereas  (oo“^)  denotes  the  reverse  order. 


and 

E[, 


^mT 


+Xi-X-^ 

(10) 

The  M-step  uses  these  expectations  substituted  into 
equations  (7)  and  (8)  to  re-estimate  the  means  and  co- 
variances.  To  re-estimate  the  mean  vector,  we  sub¬ 
stitute  the  values  of  xfj  for  the  missing  components  of 
Xi  in  equation  (7),  To  re-estimate  the  covariance  matrix 
we  substitute  the  values  of  the  bracketed  term  in  (10)  for 
the  outer  product  matrices  involving  the  missing  compo¬ 
nents  of  Xi  in  equation  (8). 

4.1.3  Discrete- valued  data:  mixture  of 
Bernoullis 

Binary  data  can  be  modeled  as  a  mixture  of  Bernoulli 
densities.  That  is,  each  D-dimensional  vector  x 
(xi, . . . ,  Xd, . . . X£)),  Xd  E  {0, 1},  is  modeled  as  generated 
from  the  mixture  of  M  Bernoulli  densities: 

M  D 

p{x\9) =Y,Picoj)  n 

i  =  l 

For  this  model  the  complete  data  E-step  computes 


hij  — 


and  the  M-step  re-estimates  the  parameters  by 

Eili 


-  fc+i 


ELhii 


(11) 


(12) 


As  before,  to  incorporate  missing  data  we  must 
compute  the  appropriate  expectations  of  the  sufficient 
statistics  in  the  E-step.  For  the  Bernoulli  mixture 
these  include  the  incomplete  terms  E[zij\xf j9k]  and 
E[zijxf^\x^,0k]-  The  first  is  equal  to  hij  calculated  over 
the  observed  sub  vector  of  Xj,  The  second,  since  we  as¬ 
sume  that  within  a  class  the  individual  dimensions  of  the 
Bernoulli  variable  are  independent,  is  simply  hijfiJ^.  The 
M-step  uses  these  expectations  substituted  into  equa¬ 
tion  (12). 

More  generally,  discrete  or  categorical  data  can  be 
modeled  as  generated  by  a  mixture  of  multinomial  den¬ 
sities  and  similar  derivations  for  the  learning  algorithm 
can  be  applied.  Finally,  the  extension  to  data  with  mixed 
real,  binary,  and  categorical  dimensions  can  be  readily 
derived  by  assuming  a  joint  density  with  mixed  compo¬ 
nents  of  the  three  types.  Such  mixed  models  can  serve 
to  solve  classification  problems,  as  will  be  discussed  in  a 
later  section. 


4.2  Clustering 

Gaussian  mixture  model  estimation  is  a  form  of  soft  clus¬ 
tering  (Nowlan,  1991).  Furthermore,  if  a  full  covariance 
model  is  used  the  principal  axes  of  the  Gaussians  align 
with  the  principal  components  of  the  data  within  each 
soft  cluster.  For  binary  or  categorical  data  soft  clus¬ 
tering  algorithms  can  also  be  obtained  using  the  above 
Bernoulli  and  multinomial  mixture  models.  We  illus¬ 
trate  the  extension  of  these  clustering  algorithms  to  miss¬ 
ing  data  problems  with  a  simple  example  from  character 
recognition. 


Figure  2:  Learning  digit  patterns.  First  row:  the  ten  5x7  templates  used  to  generate  the  data  set.  Second  row: 
templates  with  Gaussian  noise  added.  Third  row:  templates  with  noise  added  and  50%  missing  pixels.  The  training 
set  consisted  of  ten  such  noisy,  incomplete  samples  of  each  digit.  Fourth  row:  means  of  the  twelve  Gaussians  at 
asymptote  30  passes  through  the  data  set  of  100  patterns)  using  the  mean  imputation  heuristic.  Fifth  row: 
means  of  the  twelve  Gaussians  at  asymptote  60  passes,  same  incomplete  data  set)  using  the  EM  algorithm. 
Gaussians  constrained  to  diagonal  covariance  matrices. 


In  this  example  (Fig.  2),  the  Gaussian  mixture  algo¬ 
rithm  was  used  on  a  training  set  of  100  35-dimensional 
noisy  greyscale  digits  with  50%  of  the  pixels  missing. 
The  EM  algorithm  approximated  the  cluster  means  from 
this  highly  deficient  data  set  quite  well.  We  compared 
EM  to  mean  imputation,  a  common  heuristic  where 
the  missing  values  are  replaced  with  their  unconditional 
means.  The  results  showed  that  EM  outperformed  mean 
imputation  when  measured  both  by  the  distance  be¬ 
tween  the  Gaussian  means  and  the  templates  (see  Fig.  2), 
and  by  the  likelihoods  (log  likelihoods  ±  1  s.e.:  EM 
—4633  ±  328;  mean  imputation  —10062  ±  1263;  n  =  5). 

4.3  Function  approximation 

So  far,  we  have  alluded  to  data  vectors  with  no  refer¬ 
ence  to  “inputs”  and  “targets.”  In  supervised  learning, 
however,  we  generally  wish  to  predict  target  variables 
from  some  set  of  input  variables — that  is,  we  wish  to  ap¬ 
proximate  a  function  relating  these  two  sets  of  variables. 
If  we  decompose  each  data  vector  Xi  into  an  “input” 
subvector,  x-,  and  a  “target”  or  output  sub  vector,  x-, 
then  the  relation  between  input  and  target  variables  can 
be  expressed  through  the  conditional  density  P(x^|xJ). 
This  conditional  density  can  be  readily  obtained  from 
the  joint  input /target  density,  which  is  the  density  which 
all  the  above  mixture  models  seek  to  estimate.  Thus, 
in  this  framework,  the  distinction  between  supervised 
learning,  i.e.  function  approximation,  and  unsupervised 
learning,  i.e.  density  estimation,  is  semantic,  resulting 
from  whether  the  data  is  considered  to  be  composed  of 
inputs  and  targets  or  not. 

Focusing  on  the  Gaussian  mixture  model  we  note  that 


the  conditional  density  P(xJ|x-)  is  also  a  Gaussian  mix¬ 
ture.  Given  a  particular  input  the  estimated  output 
should  summarize  this  density. 

If  we  require  a  single  estimate  of  the  output,  a  natu¬ 
ral  candidate  is  the  least  squares  estimate  (LSE),  which 
takes  the  form  x^(x-)  =  E'(x- |x-).  Expanding  the  expec¬ 
tation  we  get 

M 

hij[n)  +  Ej-'Ey  '(x;-  -  ti))],  (13) 

which  is  a  convex  sum  of  the  least  squares  linear  approx¬ 
imations  given  by  each  Gaussian.  The  weights  in  the 
sum,  hij ,  vary  nonlinearly  over  the  input  space  and  can 
be  viewed  as  corresponding  to  the  output  of  a  classifier 
that  assigns  to  each  point  in  the  input  space  a  probability 
of  belonging  to  each  Gaussian.®  The  least  squares  esti¬ 
mator  has  interesting  relations  to  models  such  as  CART 
(Breiman  et  al.,  1984),  MARS  (Friedman,  1991),  and 
mixtures  of  experts  (Jacobs  et  al.,  1991;  Jordan  and 
Jacobs,  1994),  in  that  the  mixture  of  Gaussians  com¬ 
petitively  partitions  the  input  space,  and  learns  a  linear 
regression  surface  on  each  partition.  This  similarity  has 
also  been  noted  by  Tresp  et  al.  (1994). 

If  the  Gaussian  covariance  matrices  are  constrained  to 
be  diagonal,  the  least  squares  estimate  further  simplifies 
to 

M 

;=i 

®The  hij  in  equation  (13)  are  computed  by  substituting 
xi  into  equation  (6)  and  evaluating  the  exponentials  over  the 
dimensions  of  the  input  space. 


the  average  of  the  output  means,  weighted  by  the  prox¬ 
imity  of  X  ■  to  the  Gaussian  input  means.  This  expression 
has  a  form  identical  to  normalized  radial  basis  function 
(RBF)  networks  (Moody  and  Darken,  1989;  Poggio  and 
Girosi,  1989),  although  the  two  algorithms  are  derived 
from  disparate  frameworks.  In  the  limit,  as  the  covari¬ 
ance  matrices  of  the  Gaussians  approach  zero,  the  ap¬ 
proximation  becomes  a  nearest-neighbor  map. 

Not  all  learning  problems  lend  themselves  to  least 
squares  estimates — many  problems  involve  learning  a 
one-to-many  mapping  between  the  input  and  target  vari¬ 
ables  (Ghahramani,  1994).  The  resulting  conditional 
densities  are  multimodal  and  no  single  value  of  the 
output  given  the  input  will  appropriately  reflect  this 
fact  (Shizawa,  1993;  Ghahramani,  1994;  Bishop,  1994). 
For  such  problems  a  stochastic  estimator,  where  the  out¬ 
put  is  sampled  according  to  x^(xj)  P(x||xJ),  is  to  be 
preferred  to  the  least  squares  estimator. 

For  learning  problems  involving  discrete  variables  the 
LSE  and  stochastic  estimators  have  a  different  interpre¬ 
tation.  If  we  wish  to  obtain  the  posterior  probability  of 
the  output  given  the  input  we  would  use  the  LSE  esti¬ 
mator.  On  the  other  hand,  if  we  wish  to  obtain  output 
estimates  that  fall  in  our  discrete  output  space  we  would 
use  the  stochastic  estimator. 

4.4  Classification 


Classification  with  missing  inputs 


%  missing  features 

Figure  3:  Classification  of  the  iris  data  set.  100  data 
points  were  used  for  training  and  50  for  testing.  Each 
data  point  consisted  of  4  real-valued  attributes  and  one 
of  three  class  labels.  The  figure  shows  classification  per¬ 
formance  ±  1  standard  error  (n  =  5)  as  a  function  of 
proportion  missing  features  for  the  EM  algorithm  and 
for  mean  imputation  (MI),  a  common  heuristic  where 
the  missing  values  are  replaced  with  their  unconditional 
means. 

Classification,  though  strictly  speaking  a  special  case 
of  function  approximation,  merits  attention  of  its  own. 
Classification  problems  involve  learning  a  mapping  from 
an  input  space  of  attributes  into  a  set  of  discrete  class 


labels.  The  mixture  modeling  framework  presented  here 
lends  itself  readily  to  classification  problems  by  modeling 
the  class  label  as  a  multinomial  variable.  For  example, 
if  the  attributes  are  real-valued  and  there  are  D  class 
labels,  a  mixture  model  with  Gaussian  and  multinomial 
components  can  be  used; 


M 


p(.,c=m  = 


exp{--(x  -  Mj)  (x  -  Mj  )} 


denotes  the  joint  probability  that  the  data  point  has  at¬ 
tributes  X  and  belongs  to  class  d,  where  the  fijd  are  the 
parameters  for  the  multinomial  class  variable.  That  is, 
fijd  =:  P{C  =  d\ujj,0)^  and  =  L 

Missing  attributes  and  missing  class  labels  (i.e.,  unla¬ 
beled  data  points)  are  readily  handled  via  the  EM  algo¬ 
rithm.  In  the  E-step,  missing  attributes  are  completed 
using  the  same  formulas  as  for  the  Gaussian  mixture  ex¬ 
cept  that 


P(x°,Ci  =  d\uj,9)  = 


E;=i  ^J^ldPixf\uJl,9) 


On  the  other  hand,  if  a  class  label  is  missing  hij  becomes 
P{xi\ujj^  ^(x*|w/,  0)^  exactly  as  in  the  Gaussian 

mixture.  The  class  label  is  then  completed  with  a  prob¬ 
ability  vector  whose  component  is  hijfijd^ 

Once  the  classification  model  has  been  estimated, 
the  most  likely  label  for  a  particular  input  x  may  be 
obtained  by  computing  P{C  =  d|x,^).  Similarly,  the 
class  conditional  densities  can  be  computed  by  evaluat¬ 
ing  P{x\C  =  d,6).  Conditionalizing  over  classes  in  this 
way  yields  class  conditional  densities  which  are  in  turn 
mixtures  of  Gaussians.  Figure  3  shows  the  performance 
of  the  EM  algorithm  on  a  sample  classification  problem 
with  varying  proportions  of  missing  features. 

This  mixture-based  approach  to  classification  is 
closely  related  to  the  mixture  discriminant  analysis 
(MDA)  approach  recently  proposed  by  Hastie  and  Tib- 
shirani  (1994).  In  MDA,  classes  are  also  fit  by  mixture 
densities  using  the  EM  algorithm  and  an  optimal  dis¬ 
criminant  is  obtained.  Hastie  and  Tibshirani  extend 
this  basic  MDA  procedure  by  combining  it  with  reduced 
rank  discrimination.  Like  Fisher- Rao  linear  discriminant 
analysis  this  results  in  an  interpretable,  low  dimensional 
projection  of  the  data  and  often  also  leads  to  improved 
classification  performance.  While  the  authors  do  not 
mention  missing  data,  it  seems  likely  that  EM  methods 
can  be  used  in  the  context  of  their  algorithm. 

Previous  approaches  to  classification  from  incomplete 
patterns  have  proceeded  along  different  lines.  Cheese- 
man  et  ai.  (1988)  describe  a  Bayesian  classification 
method  in  which  each  class  is  modeled  as  having  Gaus¬ 
sian  real- valued  attributes  and  multinomial  discrete  at¬ 
tributes.  The  learning  procedure  finds  the  maximum  a 
'posteriori  parameters  of  the  model  by  differentiating  the 
posterior  probability  of  the  class  parameters  and  setting 
to  zero.  This  yields  a  coupled  set  of  nonlinear  equations. 
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similar  to  the  EM  steps,  which  can  be  iterated  to  find 
the  posterior  mode  of  the  parameters  (Dempster  et  ai., 
1977).  To  handle  missing  data  the  authors  state  that 
Tor  discrete  attributes  it  can  be  shown  that  the  correct 
procedure  for  treating  an  unknown  value  is  equivalent  to 
adding  an  'unknown’  category  to  the  value  set”  (p.  62). 
For  real-valued  attributes  they  add  a  'known’/'unknown’ 
category  to  each  attribute  and  set  its  value  appropri¬ 
ately.  Three  comments  can  be  made  about  this  ap¬ 
proach.  First,  each  'unknown’  category  added  to  the 
multinomial  value  set  results  in  an  extra  parameter  that 
has  to  be  estimated.  Furthermore,  adding  an  'unknown’ 
category  does  not  reflect  the  fact  that  the  unobserved 
data  actually  arises  from  the  original  multinomial  value 
set  (an  argument  also  made  by  Quinlan,  1986;  see  be¬ 
low).  For  example,  for  a  data  set  in  which  one  attribute 
is  often  unknown  the  algorithm  may  form  a  class  based 
on  that  attribute  taking  on  the  value  'unknown’ — a  situ¬ 
ation  which  is  clearly  undesirable  in  a  classifier.  Finally, 
as  each  class  is  modeled  by  a  single  Gaussian  or  multino¬ 
mial  and  the  data  points  are  assumed  to  be  unlabeled, 
the  Cheeseman  et  al.  (1988)  algorithm  is  in  fact  a  form 
of  soft  clustering. 

Southcott  and  Bogner  (1993)  have  approached  the 
problem  of  classification  of  incomplete  data  using  an  ap¬ 
proximation  to  EM  for  clustering.  In  the  E-step,  the 
observed  data  are  classified  using  the  current  mixture 
model,  and  each  data  point  is  assigned  to  its  most  likely 
class.  The  parameters  of  each  class  are  then  re-estimated 
in  the  M-step.  In  our  notation  this  approximation  corre¬ 
sponds  to  setting  the  highest  /ifj  for  each  data  point  to 
1  and  all  the  others  to  0.  They  compared  this  method 
with  a  neural  network  based  algorithm  in  which  each 
missing  input  is  varied  through  the  possible  range  of 
(discrete)  attribute  values  to  find  the  completion  result¬ 
ing  in  minimum  classification  error.  They  reported  that 
their  approximation  to  EM  outperformed  both  the  neu¬ 
ral  network  algorithm  and  an  algorithm  based  on  linear 
discriminant  analysis.  They  did  not  include  the  exact 
EM  algorithm  in  their  comparison. 

Quinlan  (1986,1989)  discusses  the  problem  of  missing 
data  in  the  context  of  decision  tree  classifiers.  Quinlan’s 
decision  tree  framework  uses  a  measure  of  information 
gain  to  build  a  classifier,  resulting  in  a  tree  structure 
of  queries  on  attribute  values  and  a  set  of  leaves  rep¬ 
resenting  class  membership.  The  author  concludes  that 
treating  'unknown’  as  a  separate  value  is  not  a  good  so¬ 
lution  to  the  missing  value  problem,  as  querying  on  at¬ 
tributes  with  unknown  values  will  have  higher  apparent 
information  gain  (Quinlan,  1986).  The  approach  that 
he  advocates  instead  is  to  compute  the  expected  infor¬ 
mation  gain,  by  assuming  that  the  unknown  attribute  is 
distributed  according  to  the  observed  values  in  the  sub¬ 
set  of  the  data  at  that  node  of  the  tree.  This  approach 
is  consistent  with  the  information  theoretic  framework 
adopted  in  his  work  and  parallels  the  EM  and  Bayesian 
treatments  of  missing  data  which  suggest  integrating 
over  the  possible  missing  values. 

An  alternative  method  of  handling  missing  data  in  de¬ 
cision  trees  is  presented  by  Breiman  et  al.  (1984)  for  the 
CART  algorithm.  CART  initially  constructs  a  large  de¬ 


cision  tree  based  on  a  splitting  criterion  closely  related  to 
the  above  measure  of  information  gain.  The  tree  is  then 
pruned  recursively  using  a  measure  of  model  complexity 
proportional  to  the  number  of  terminal  nodes,  resulting 
in  a  smaller,  more  interpretable  tree  with  better  gener¬ 
alization  properties.  If  a  case  is  missing  the  value  of  an 
attribute  then  it  is  not  considered  when  evaluating  the 
goodness  of  splits  on  that  attribute.  Cases  are  assigned 
to  branches  of  a  split  on  an  attribute  where  they  have 
missing  values  using  the  best  'surrogate  split’ — i.e.  the 
split  on  another  attribute  which  partitions  the  data  most 
similarly  to  the  original  split.  This  method  works  well 
when  there  is  a  single,  highly  correlated  attribute  that 
predicts  the  effects  of  a  split  along  the  missing  attribute. 
However,  if  no  single  attribute  can  predict  the  effects  of 
the  split  this  method  may  not  perform  well.  An  approach 
based  on  computing  the  expected  split  from  all  the  ob¬ 
served  variables,  similar  to  Quinlan’s,  would  be  more 
suitable  from  a  statistical  perspective  and  may  provide 
improved  performance  with  missing  data. 


In  Bayesian  learning  the  parameters  are  treated  as  un¬ 
known  random  variables' characterized  by  a  probability 
distribution.  Bayesian  learning  utilizes  a  prior  distribu¬ 
tion  for  the  parameters,  which  may  encode  world  knowl¬ 
edge,  initial  biases  of  the  learner,  or  constraints  on  the 
probable  parameter  values.  Learning  proceeds  by  Bayes’ 
rule — multiplying  the  prior  probability  of  the  parameters 
by  the  likelihood  of  the  data  given  the  parameters,  and 
normalizing  by  the  integral  over  the  parameter  space — 
resulting  in  a  posterior  distribution  of  the  parameters. 
The  information  learned  about  the  unknown  parameters 
is  expressed  in  the  form  of  this  posterior  probability  dis¬ 
tribution. 

In  the  context  of  learning  from  incomplete  data,  the 
Bayesian  use  of  priors  can  have  impact  in  two  arenas. 
First,  the  prior  may  reflect  assumptions  about  the  initial 
distribution  of  parameter  values  as  described  above.  The 
learning  procedure  converts  this  prior  into  a  posterior  via 
the  data  likelihood.  We  have  seen  that  to  perform  this 
conversion  independently  of  the  missing  data  mechanism 
requires  both  that  the  mechanism  be  missing  at  random 
and  that  the  prior  be  factorizable.  Second,  the  prior 
may  reflect  assumptions  about  the  initial  distribution  of 
the  missing  values.  Thus,  if  we  have  a  prior  distribution 
for  input  values  we  can  complete  the  missing  data  by 
sampling  from  this  clistribution. 

For  complete  data  problems  and  simple  models  the 
judicious  choice  of  conjugate  priors  for  the  parameters 
often  allows  analytic  computation  of  their  posterior  dis¬ 
tribution  (Box  and  Tiao,  1973).  However,  in  incomplete 
data  problems  the  usual  choices  of  conjugate  priors  do 
not  generally  lead  to  recognizable  posteriors,  making  it¬ 
erative  simulation  and  sampling  techniques  for  obtaining 
the  posterior  distribution  indispensable  (Schafer,  1994). 

5.1  Data  augmentation  and  Gibbs  sampling 

One  such  technique,  which  is  closely  related  in  form  to 
the  EM  algorithm,  is  data  augmentation  (Tanner  and 
Wong,  1987).  This  iterative  algorithm  consists  of  two 
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steps.  In  the  Imputation  or  I-step,  instead  of  comput¬ 
ing  the  expectations  of  the  missing  sufficient  statistics, 
we  simulate  m  random  draws  of  the  missing  data  from 
their  conditional  distribution  ,9).  In  the  Pos¬ 

terior  or  P-step  we  sample  m  times  from  the  posterior 
distribution  of  the  parameters,  which  can  now  be  more 
easily  computed  with  the  imputed  data:  P{d\X^,X'^). 
Thus,  we  obtain  samples  from  the  joint  distribution  of 
by  alternately  conditioning  on  one  or  the 
other  of  the  unknown  variables,  a  technique  known  as 
Gibbs  sampling  (Geman  and  Geman,  1984).  Under  some 
mild  regularity  conditions  this  algorithm  can  be  shown 
to  converge  in  distribution  to  the  posterior  (Tanner  and 
Wong,  1987).  Note  that  the  augmented  data  can  be  cho¬ 
sen  so  as  to  simplify  the  P-step  in  much  the  same  way  as 
indicator  variables  can  be  chosen  to  simplify  the  M-step 
in  EM. 

Data  2iug- 

mentation  techniques  have  been  recently  combined  with 
the  Metropolis-Eastings  algorithm  (Schafer,  1994).  In 
Metropolis-Eastings  (Metropolis  et  ah,  1953;  Eastings, 
1970),  one  creates  a  Monte  Carlo  Markov  chain  by  draw¬ 
ing  from  a  probability  distribution  meant  to  approximate 
the  distribution  of  interest  and  accepting  or  rejecting  the 
drawn  value  based  on  an  acceptance  ratio.  The  accep¬ 
tance  ratio,  e.g.  the  ratio  of  probabilities  of  the  drawn 
state  and  the  previous  state,  can  often  be  chosen  to  be 
easy  to  calculate  as  it  does  not  involve  computation  of 
the  normalization  factor.  If  the  transition  probabilities 
allow  any  state  to  be  reached  eventually  from  any  other 
state  (i.e.  the  chain  is  ergodic)  then  the  Markov  chain 
will  approach  its  stationary  distribution,  chosen  to  be 
the  distribution  of  interest,  from  any  initial  distribution. 
The  combination  of  data  augmentation  and  Metropolis- 
Eastings  can  be  used,  for  example,  in  problems  where 
the  posterior  itself  is  difficult  to  sample  from  in  the  P- 
step.  For  such  problems  one  may  generate  a  Markov 
chain  whose  stationary  distribution  is  P{0\X^,X'^). 

5.2  Multiple  imputation  and  Bayesian 
backpropagation 

Multiple  imputation  (Rubin,  1987)  is  a  technique  in 
which  each  missing  value  is  replaced  by  m  simulated  val¬ 
ues  which  reflect  uncertainty  about  the  true  value  of  the 
missing  data.  After  multiple  imputation,  m  completed 
data  sets  exist,  each  of  which  can  be  analyzed  using  com¬ 
plete  data  methods.  The  results  can  then  be  combined 
to  form  a  single  inference.  Though  multiple  imputation 
requires  sampling  from  P{X'^\X^,  0),  which  may  be  dif¬ 
ficult,  iterative  simulation  methods  can  also  be  used  in 
this  context  (Schafer,  1994). 

The  Bayesian  backpropagation  technique  for  missing 
data  presented  by  Buntine  and  Weigend  (1991)  is  a  spe¬ 
cial  case  of  multiple  imputation.  In  Bayesian  backpropa¬ 
gation,  multiple  values  of  the  input  are  imputed  accord¬ 
ing  to  a  prior  distribution  so  as  to  approximate  the  inte¬ 
gral  in  (3),  which  in  turn  is  used  to  compute  the  gradient 
required  for  backpropagation.  This  procedure  is  similar 
to  that  of  Tresp  et  al.  (1994),  except  that  whereas  the 
former  completes  the  data  by  sampling  from  a  prior  dis¬ 
tribution  of  inputs,  the  latter  estimates  this  distribution 


directly  from  the  data." 

6  Boltzmann  machines  and  incomplete 
data 

Boltzmann  machines  are  networks  of  binary  stochastic 
units  with  symmetric  connections,  in  which  learning  cor¬ 
responds  to  minimizing  the  relative  entropy  between  the 
probability  distribution  of  the  visible  states  and  a  target 
distribution  (Einton  and  Sejnowski,  1986).  The  relative 
entropy  cost  function  can  be  rewritten  to  reveal  that, 
if  the  target  distribution  is  taken  to  be  the  empirical 
distribution  of  the  data,  it  is  equivalent  to  the  model 
likelihood.  Therefore,  the  Boltzmann  learning  rule  im¬ 
plements  maximum  likelihood  density  estimation  over 
binary  variables. 

The  Boltzmann  learning  procedure  first  estimates  cor¬ 
relations  between  unit  activities  in  a  stage  where  both 
input  and  target  units  are  clamped  and  in  a  stage  where 
the  target  units  are  undamped.  These  correlations  are 
then  used  to  modify  the  parameters  of  the  network  in 
the  direction  of  the  relative  entropy  cost  gradient.  This 
moves  the  output  unit  distribution  in  the  undamped 
phase  closer  to  the  target  distribution  in  the  damped 
phase. 

Reformulated  in  terms  of  maximum  likelihood  condi¬ 
tional  density  estimation,  the  Boltzmann  learning  rule 
is  an  instance  of  the  generalized  EM  algorithm  (GEM; 
Dempster,  Laird,  and  Rubin,  1977):  the  estimation  of 
the  unit  correlations  given  the  current  weights  and  the 
damped  values  corresponds  to  the  E-step,  and  the  up¬ 
date  of  the  weights  corresponds  to  the  M-step  (Einton 
and  Sejnowski,  1986).  It  is  generalized  EM  in  the  sense 
that  the  M-step  does  not  actually  maximize  the  likeli¬ 
hood  but  simply  increases  it  by  gradient  ascent. 

The  incomplete  variables  in  the  Boltzmann  machine 
are  the  states  of  the  hidden  units — those  that  are  not 
denoted  as  the  visible  input  or  output  units.  This  sug¬ 
gests  that  the  principled  way  of  handling  missing  inputs 
or  targets  in  a  Boltzmann  machine  is  to  treat  them  as 
hidden  units,  that  is,  to  leave  them  undamped.  Ex¬ 
actly  as  in  the  formulation  for  mixture  models  presented 
above,  the  EM  algorithm  will  then  estimate  the  appro¬ 
priate  sufficient  statistics — the  first  order  correlations — 
in  the  E-step.  These  sufficient  statistics  will  then  be  used 
to  increase  the  model  likelihood  in  the  M-step. 

7  Conclusions 

There  are  several  ways  of  handling  missing  data  dur¬ 
ing  learning.  Eeuristics,  such  as  filling  in  the  missing 
data  with  unconditional  or  conditional  means,  are  not  al¬ 
ways  efficient,  discarding  information  latent  in  the  data 
set.  More  principled  statistical  approaches  yield  inter¬ 
pretable  results,  providing  a  guarantee  to  find  the  max¬ 
imum  likelihood  parameters  despite  the  missing  data. 

These  statistical  approaches  argue  convincingly  that 
the  missing  data  has  to  be  integrated  out  using  an  esti¬ 
mate  of  the  data  density.  One  class  of  models  in  which 

’’From  a  strictly  Bayesian  point  of  view  both  procedures 
are  improper  in  that  they  don’t  take  into  account  the  vari¬ 
ability  of  the  parameters  in  the  integration. 


this  can  be  performed  naturally  and  efficiently  are  mix¬ 
ture  models.  For  these  models,  we  have  described  appli¬ 
cations  to  clustering,  function  approximation,  and  clas¬ 
sification  from  real  and  discrete  data.  In  particular,  we 
have  shown  how  missing  inputs  and  targets  can  be  incor¬ 
porated  into  the  mixture  model  framework — essentially 
by  making  a  dual  use  of  the  ubiquitous  EM  algorithm. 

Finally,  our  principal  conclusion  is  that  virtually  all  of 
the  incomplete  data  techniques  reviewed  from  the  neural 
network  and  machine  learning  literatures  can  be  placed 
within  this  basic  statistical  framework. 
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